additional detail
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.27)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > Russia (0.14)
- (92 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Questionnaire & Opinion Survey (1.00)
- (2 more...)
- Media (1.00)
- Leisure & Entertainment (1.00)
- Information Technology > Security & Privacy (1.00)
- (10 more...)
Supplementary Material Hardware Resilience Properties of Text-Guided Image Classifiers This section contains supplementary material that provides additional details for the main paper and
Note that for error injection experiments, we perform single-bit flips only in the convolutional and linear layers of the neural network, in line with other work in this field. In this section, we provide visualizations of additional backbones. Figure 9 and Figure 10 extend from Figure 3 for more networks. The Y -axis shows the absolute value of the max neuron value observed per layer on the X-axis. Next, Figure 11 and Figure 12 are extensions for Figure 4, showcasing the impact of our proposed technique on the end-to-end network accuracy.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- North America > Canada (0.04)
Scaling Diffusion Transformers Efficiently via $μ$P
Zheng, Chenyu, Zhang, Xinyu, Wang, Rongzhen, Huang, Wei, Tian, Zhi, Huang, Weilin, Zhu, Jun, Li, Chongxuan
Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($μ$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $μ$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $μ$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $μ$P of mainstream diffusion Transformers, including U-ViT, DiT, PixArt-$α$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $μ$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$μ$P enjoys robust HP transferability. Notably, DiT-XL-2-$μ$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $μ$P on text-to-image generation by scaling PixArt-$α$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $μ$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$α$ and 3% of consumption by human experts for MMDiT-18B. These results establish $μ$P as a principled and efficient framework for scaling diffusion Transformers.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.27)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > Russia (0.14)
- (92 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Questionnaire & Opinion Survey (1.00)
- (2 more...)
- Media (1.00)
- Leisure & Entertainment (1.00)
- Information Technology > Security & Privacy (1.00)
- (10 more...)
Supplementary Material Hardware Resilience Properties of T ext-Guided Image Classifiers This section contains supplementary material that provides additional details for the main paper and
Note that for error injection experiments, we perform single-bit flips only in the convolutional and linear layers of the neural network, in line with other work in this field. In this section, we provide visualizations of additional backbones. Figure 9 and Figure 10 extend from Figure 3 for more networks. The Y -axis shows the absolute value of the max neuron value observed per layer on the X-axis. Next, Figure 11 and Figure 12 are extensions for Figure 4, showcasing the impact of our proposed technique on the end-to-end network accuracy.